Abstract
With the increase in availability of high depth whole genome sequencing (WGS) data of individuals with sickle cell disease (SCD), easy access to the raw sequencing data remains an issue due to technical and regulatory challenges. A compliant system that can provide facile data access would accelerate scientific discovery of genetic variants associated with clinical phenotypes. Cloud storage and computing provide an ultimate solution to this data access, which we have shown through the St. Jude Cloud (https://stjude.cloud) where over 5000 whole genome sequences for pediatric cancer patients are being shared in collaboration with DNANexus and Microsoft. Here we expand the St. Jude Cloud to sickle cell disease data through the Sickle Genome Project (SGP) Data Portal (https://pecan.stjude.org/permalink/sgp) to allow instantaneous raw data access (following data access committee approval), as well as visualization of genotype calls at individual level in a novel genome. The SGP WGS data was generated from 871 patients from St. Jude Children's Research Hospital (St. Jude) through the Sickle Cell Clinical Research and Intervention Program (SCCRIP, Pediatr Blood Cancer. 2018 May 24: e27228) and from Baylor College of Medicine (BCM). All study participants provided informed consent for genomic study and data sharing on IRB-approved research protocols.
The SGP data portal will have multi-tiered access. All users will have access to a general heat map view which shows anonymized patient clinical values (e.g., fetal hemoglobin (HbF), mean corpuscular volume (MCV), hemoglobin concentration (Hb)) and relevant SCD modifying variants (e.g., Beta-globin locus, MYB, BCL11A, HBA). The GenomePaint bowser allows for viewing coding and noncoding variants. Displayed with each variant will be a visual indication of the median fetal hemoglobin values for patients homozygous for the reference allele, heterozygous, or homozygous for the alternative allele. The browser also displays erythroid specific DNA-accessibility and epigenetic marks and indicates variant that may disrupt erythroid specific transcription factor binding sites (GATA1 and BCL11A). For anonymization purposes, within the genome browser and heat map views, clinical values and the patients age will be binned into ranges when displayed as single or low count values. Lastly, the ProteinPaint tool (Zhang and Zhou, Nature Gen, Dec 29, 2015) will enable visualization and filtering of variants with reference to protein domain and amino acid sequence.
To access processed data such as BAM and VCF files for downstream analyses, a user will be required to apply for access which will be adjudicated by a data access committee. Verified researchers will be granted access to clinical data in a manner consistent with the protocol specific informed consent documentation and protocol under which the sequencing was performed. This may include coded clinical and demographic data when specified by the research protocol and informed consent
The SGP data set will be one of the first WGS datasets from primarily African American Sickle cell patients to be made available to clinicians and researchers worldwide. In addition, no SCD-centric data portal exists that contains controlled access to data and provides graphical tools for visual analysis. The combination of the visual tools and ability to download tools provides the scientific community an invaluable resource for studying sickle cell disease.
Estepp:Daiichi Sankyo: Consultancy; NHLBI: Research Funding; Global Blood Therapeutics: Consultancy, Research Funding; ASH Scholar: Research Funding. Hankins:Global Blood Therapeutics: Research Funding; bluebird bio: Consultancy; NCQA: Consultancy; Novartis: Research Funding.
Author notes
Asterisk with author names denotes non-ASH members.